Data Flow & Collection Relationships
This document explains the data flow patterns and collection relationships in the notification system. It traces how data enters the system from external sources (SuperSet portal, email, official websites), how it is processed and stored in MongoDB collections, and how it is delivered to users. It also covers upsert logic to avoid duplicates, data lifecycle management, and how the schema supports real-time notifications and historical analysis.
The system is organized around a modular architecture:
Data ingestion from SuperSet, email, and official websites
Structuring and enrichment of notices and jobs
Storage in MongoDB collections
Notification dispatch across channels
Historical analytics and statistics computation
Diagram sources
Section sources
DBClient: Provides MongoDB connection and exposes typed collection handles.
DatabaseService: Implements CRUD and aggregation operations across collections, including notices, jobs, placement offers, users, policies, and official placement data.
SupersetClientService: Authenticates and fetches notices and jobs from SuperSet, with enrichment for detailed job information.
NoticeFormatterService: LLM-powered notice classification, job matching, extraction, and formatting for notifications.
UpdateRunner: Orchestrates fetching notices/jobs from SuperSet, deduplication, enrichment, and saving to DB.
PlacementService: Extracts placement offers from emails using a LangGraph pipeline and merges updates into PlacementOffers.
OfficialPlacementService: Scrapes official placement statistics from the JIIT website and stores normalized data.
NotificationRunner and NotificationService: Dispatch unsent notices to Telegram and Web Push channels, marking them as sent.
Section sources
The system follows a pipeline architecture:
Ingestion: SuperSet notices/jobs, email placement offers, official website data
Processing: Structuring, enrichment, LLM-based formatting, deduplication
Storage: MongoDB collections with targeted upsert logic
Delivery: Real-time notifications to Telegram/Web Push
Analytics: Historical stats and reporting
Diagram sources
Data Flow from External Sources to MongoDB#
SuperSet notices and jobs:
SupersetClientService authenticates and fetches notices and job listings.
UpdateRunner filters existing IDs, enriches only new jobs, and passes notices through NoticeFormatterService.
DatabaseService saves notices and upserts jobs.
Email-based placement offers:
PlacementService runs a LangGraph pipeline to classify, extract, validate, sanitize, and format placement offers.
DatabaseService merges updates into PlacementOffers with deduplication by company and student records.
Official website statistics:
OfficialPlacementService scrapes and normalizes data, then DatabaseService inserts or updates OfficialPlacementData using content hash to detect changes.
Diagram sources
Section sources
Collection Relationships and Data Modeling#
Notices: Stores formatted notices with sent status flags for channels.
Jobs: Structured job listings with eligibility, location, package, and hiring flow.
PlacementOffers: Company-wise placement offers with roles, students, and package details.
Users: User profiles and subscription status for notifications.
Policies: Academic placement policies by year.
OfficialPlacementData: Normalized official statistics with content hash for change detection.
Diagram sources
Section sources
Upsert Logic and Duplicate Prevention#
Notices: Insert-only by ID; existence checked before insert to prevent duplicates.
Jobs: Upsert by ID; if existing, replace with merged fields and updated timestamp.
PlacementOffers: Merge by company; roles and students merged with deduplication and package prioritization; emits events for new offers and updates.
OfficialPlacementData: Hash-based deduplication; content hash computed excluding timestamps and previous hash; unchanged content updates timestamp only.
Users: Upsert by user_id; activate/deactivate logic for soft deletion.
Prioritize higher packages"] MergeLogic --> EmitEvents["Emit new/update events"] HashCheck --> CompareHash{"Hash unchanged?"} CompareHash --> |Yes| UpdateTS["Update scrape_timestamp"] CompareHash --> |No| InsertDoc["Insert with content_hash"] Insert --> End([Done]) Skip --> End InsertDoc --> End UpdateTS --> End UpsertJob --> End UpsertUser --> End
Diagram sources
Section sources
Data Lifecycle Management#
Creation: Notices and jobs created via ingestion; placement offers created/updated via email processing; official stats created via scraping.
Updates: Jobs and placement offers are upserted; notices marked as sent after delivery; users activated/deactivated.
Archiving/Cleanup: TTL indexes recommended for logs and temporary data; unsent notices retained for retry; official stats stored with timestamps for historical analysis.
Section sources
Real-time Delivery and Historical Analysis#
Real-time: NotificationService fetches unsent notices and broadcasts to Telegram/Web Push; marks as sent upon successful delivery.
Historical: Placement statistics computed from PlacementOffers; official stats aggregated from OfficialPlacementData; notices archived with timestamps for audit.
Section sources
The system exhibits low coupling and high cohesion:
DBClient encapsulates MongoDB connectivity and collection exposure.
DatabaseService centralizes all DB operations and maintains clear separation of concerns.
SupersetClientService and OfficialPlacementService are specialized for ingestion.
NoticeFormatterService and PlacementService encapsulate LLM pipelines for processing.
NotificationRunner and NotificationService orchestrate delivery.
Diagram sources
Section sources
Efficient queries: Use targeted projections and limits; leverage indexes on frequently queried fields (e.g., createdAt, updatedAt, id).
Batch operations: InsertMany for bulk notices/jobs; minimize round-trips.
Hash-based deduplication: OfficialPlacementData reduces unnecessary writes.
TTL indexes: Automatically expire logs and temporary data.
Connection pooling: Leverage PyMongo’s pooled connections.
[No sources needed since this section provides general guidance]
Common issues and resolutions:
MongoDB connection failures: Verify MONGO_CONNECTION_STR; check DBClient initialization and ping command.
Duplicate notices: Confirm notice_exists check and ID uniqueness; review save_notice logic.
Missing job details: Ensure enrich_jobs is called for new jobs; verify SupersetClientService enrichment flow.
Placement offer conflicts: Review merge logic for roles and students; confirm package prioritization rules.
Notification delivery failures: Inspect NotificationService broadcast results; verify channel configurations.
Section sources
The system integrates external data sources, processes and structures content, and persists it in MongoDB collections with robust upsert logic to prevent duplicates. It supports real-time notifications and historical analytics through dedicated collections and aggregation functions. Clear separation of concerns and modular components enable maintainability and scalability.
[No sources needed since this section summarizes without analyzing specific files]
Appendix A: Representative Data Samples#
Structured job listings sample: app/data/structured_job_listings.json
Placement offers sample: app/data/placement_offers.json
Section sources